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Abstract 

Detection of yet unknown subgroups showing differential gene or protein expression is a frequent goal in the analysis of 
modern molecular data. Applications range from cancer biology over developmental biology to toxicology. Often a control 
and an experimental group are compared, and subgroups can be characterized by differential expression for only a 
subgroup-specific set of genes or proteins. Finding such genes and corresponding patient subgroups can help in 
understanding pathological pathways, diagnosis and defining drug targets. The size of the subgroup and the type of 
differential expression determine the optimal strategy for subgroup identification. To date, commonly used software 
packages hardly provide statistical tests and methods for the detection of such subgroups. Different univariate methods for 
subgroup detection are characterized and compared, both on simulated and on real data. We present an advanced design 
for simulation studies: Data is simulated under different distributional assumptions for the expression of the subgroup, and 
performance results are compared against theoretical upper bounds. For each distribution, different degrees of deviation 
from the majority of observations are considered for the subgroup. We evaluate classical approaches as well as various new 
suggestions in the context of omics data, including outlier sum, PADGE, and kurtosis. We also propose the new FisherSum 
score. ROC curve analysis and AUC values are used to quantify the ability of the methods to distinguish between genes or 
proteins with and without certain subgroup patterns. In general, FisherSum for small subgroups and ?-test for large 
subgroups achieve best results. We apply each method to a case-control study on Parkinson's disease and underline the 
biological benefit of the new method. 
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Introduction 

Subgroup detection is a common goal in many analyses of 
modem omics data. For gene and protein expression data, the 
most frequent research task is the comparison of a control group 
and an experimental group, or of healthy and diseased subjects. 
Parametric methods, e.g. based on Student's /-test or moderated t- 
test statistics, as well as non-parametric methods, e.g. based on 
permutations as in the WUcoxon test statistic, are commonly used. 
However, in many cases the underlying assumption of these 
methods, a homogenous experimental group, is not justified. This 
is especially true for cancer biology where even a clearly defined 
cancer type can be associated with a subgroup structure due to 



inherent biological heterogeneity. It has been shown that for 
several cancer types certain oncogenes can cause heterogeneous 
expression patterns, e.g. in breast cancer [1], lung cancer [2], and 
prostate cancer [3]. In such cases the disease group might be 
decomposed into subgroups that can be characterized by 
differential expression for different sets of genes or proteins. 

For some cancer types, such patient subgroups are known to 
some extent and can be explained by different molecular subtypes 
of the cancer, or by clinical variables like tumor grade or stage. 
Moreover, the underlying aim for the ongoing development of 
subgroup detection methods is the identification of yet unknown 
subgroups that are of clinical relevance, for example since group 
membership correlates with progression or therapy response. 
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Results from subgroup detection analysis might be highly relevant 
to personalized medicine, e.g. for the development of new drug 
targets. Therefore, a large amount of data is collected and 
searched for variables with a distinct expression pattern: Prefer- 
ably, only the (yet unknown but) relevant patient subgroup shows 
e.g. higher expression levels in particular variables, whereas the 
expression between healthy controls and other diseased patients 
does not differ. Thus, instead of identifying biomarkers that are 
able to distinguish all observations in the experimental group from 
controls, often the focus is on the detection of single features, i.e. 
genes or proteins, that characterize disease stages or disease 
subtypes. In this case, subgroup detection approaches are based on 
univariate methods. 

In many software packages for the analysis of omics data only 
basic measures and tests for differential expression are available, 
typically including Student's r-test. This is the optimal test for the 
detection of location differences between two groups of samples if 
all observations per group are realizations of random variables 
with the same normal distribution. However, depending on the 
aim of the study exacdy those variables that do not meet the 
assumption of identical distributions are the most interesting ones. 

For that reason, many univariate methods for subgroup 
detection have been proposed in the literature, especially in the 
context of omics data analysis. Depending on the interest of the 
researcher these methods can be used as stand-alone methods for 
univariate analyses or for dimensionality reduction before the 
application of multivariate methods [4]. introduced COPA, a 
method for cancer outlier profile analysis, which is shown to be more 
powerful than the ^-test, in case of a small number of up-regulated 
values. An important aspect of COPA is the search for pairs of 
genes with mutually exclusive outlier samples. For the above 
mentioned reasons, we focus on univariate methods and do not 
consider COPA in our work [5]. proposed a two-step procedure 
called profile analysis using clustering and kurtosis (PACK), which 
consists of a preselection step to identify variables of interest and 
the subsequent computation of the kurtosis to characterize the 
subgroup pattern and rank the variables according to their 
importance [6]. proposed the outlier sum (OS), in which first 
location and variance are determined with robust measures and 
then values of extreme observations arc summed up to an overall 
score [7]. presents a refinement, the outlier robust t-statistic (ORT), in 
which location and variability are estimated only from observa- 
tions of the control group [8] . presented the percentile analysis for 
differential gene expression (PADGE), a strateg)-' that makes use of 
existing statistical tests (e.g. ?-test, Wilcoxon test) and applies them 
iteratively to subsets of extreme observations, where the size of the 
subset is de[:reasing. Existing subgroups will cause a characteristic 
increase in the fold change when plotted against percentiles. 

In previously published simulation studies, subgroup size is 
varied and for the subgroup of interest, a fixed distribution is 
considered, mostly a normal distribution with shift 2 [9]. 
compared the performance of OS and Student's /-test and were 
the first to consider a more general, non-parametric shift 
alternative. However, a comprehensive comparison of all these 
approaches is still missing. The goal of this paper is to perform a 
profound comparison of basic tests commonly used in omics 
facilities with such tests that are specifically developed for outlier or 
subgroup detection. We provide guidance in which situation which 
method is best or at least competitive. We also include a new test 
strategy that is based on the simple idea of Fisher's exact test. 

In addition to univariate statistical tests for differential analysis, 
many software solutions provide multivariate procedures such as 
principle component analysis (PCA) or hierarchical clustering. 
These methods are useful for visualization of the structure of the 



data and to detect global differences between groups of samples, 
and we highly recommend their use to obtain a first impression of 
the data. They facilitate the examination if the most relevant 
groups (such as healthy vs. diseased or treated vs. untreated) 
separate well or if problems with the sample material or 
preprocessing steps are likely. Thus, systematic errors that result 
from batch or lab biases may become apparent. Obviously, in case 
the PCA reveals any subgroups of samples in the disease group, 
these should be analyzed further. But unfortunately, in many 
applications, there are no subgroups that stand out in resulting 
biplots. This is due to the fact that small subgroups often have only 
a small impact on the overall variation in the data, especially if the 
subgroup only differs regarding a small proportion of the variables 
in the data set or if the difference in expression levels is small. This 
has been shown before, e.g. by [10], who have illustrated by means 
of real and simulated data, that this type of subgroup does not 
necessarily separate from the remainder of observations in 
ordinary biplots. Again, basic methods like PCA are not 
appropriate for the detection of small subgroups in single variables 
of larger data sets. Precisely since each variable is assessed 
independentiy at first, univariate approaches permit a different 
view on the data in comparison to common multivariate methods. 

We present an innovative design for simulation studic-s in the 
context of subgroup detection. We focus on variables where 
observations in the reference group (controls or healthy subjects) 
and the majority of the disease group follow the same distribution 
and the remaining observations form exactiy one subgroup within 
the disease group and show differential \alucs. Note that the 
assumption of a single subgroup per variable does not imply that 
the disease group in general incorporates only one subgroup. In 
fact, it is quite possible that there are several subgroups that reflect 
e.g. different subtypes of the disease. More details on this issue can 
be found in the discussion. In our simulation, different assumptions 
about the differential subgroup size and the corresponding 
distribution ar(; made-. We investigate different iypv.^ of location 
and scale shifts of the expression in the subgroup. In contrast to 
previous publications, we do not vary total sample size and the size 
of the subgroup alone, but also the type of the subgroup 
distribution as well as the degree of deviation of the subgroup. 
This means that different methods are compared over a wide 
range of alternatives in contrast to the common comparison at one 
or two alternatives, which may or may not reflei:t the general 
performance of the methods. Furthermore, the consideration of a 
case-control design allows for the consideration of variables that 
show non-disease-specific subgroups, which is an important factor 
when it comes to real data application. However, this has been 
ignored for the most part up to now. Another advantage of our 
approach is that we can calculate an upper bound for the 
performance, since in the controlled scenario of the simulation 
study the true distributions and therefore the theoretically optimal 
likelihood ratio is available. We also apply the considered tests to 
real proteomics data from a case-control study on Parkinson's 
disease and show that the new FisherSum method generates new 
biological insights. 

The remainder of this work is organized as follows. The section 
Methods introduces all measures and tests that are taken into 
consideration for our simulation study, including the new method. 
We then describe the overall design of the Simulation study for 
subgroup detection, list the different parameter settings that were used 
with different distributional assumptions, elucidate our advanced 
study design, and explain the quality criterion we chose to 
compare the different methods. Afterwards, we present the Results 
of the simulation study and of the real data analysis. We close with 
a Discussion and an outiook. 
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Methods 

Tests and Scores in comparison 

In this section, we introduce all methods for subgroup detection 
that are compared in the simulation study. Mostly up-regnlation is 
regarded as the preferred direction for biomarkers. Thus we 
consider, w.l.o.g., the one-sided case of up-regulated subgroups, 
with corresponding adaptations in tests and scores where possible, 
e.g. the one-sided instead of the two-sided f-test. If desired, the 
methods are also easily apphcable to bidirectional regulations. 

As standard method, we include Student's /-test, a good choice 
for global shifts. In the next section, we first introduce four 
methods for subgroup detection that were developed in the context 
of omics data, namely OS [6], ORT [7], PADGE [8] andkurtosis 
as central part of PACK [5] . Corresponding R code can be found 
in File SI. Additionally, we include Bartlett's test for homosce- 
dasticity as possible alternative. All these methods are univariate 
and corresponding scor(;s and /)-values respectively are computed 
separately for each variable in the data set, independendy from 
observations in other variables. 

We then introduce a new score called FisherSum that combines 
ideas from the Minimum jV/ score introduced by [1 1] and the outiier 
sum. Finally, we describe the calculation of an upper theoretical 
bound on the performance of the tests. 

Previously presented methods for subgroup detection 

The test statistic OS sums up expression values that are 
regarded as outliers according to a robust definition. First step is 
the robust standardization of the expression values per gene by 
dividing the median-centered values by the MAD (median 
absolute deviation). Values from the disease group D that exceed 
a threshold cqs, which is the sum of the 75 percent quantile ^75 
and the interquartile range IQR per gene, are defined as outiiers 
and added up to the OS, i.e. 



0S= X, 



Large values of OS can result from a single outlier much greater 
than Cos or from an outlier subgroup above the threshold. 

The outlier definition that is used for ORT is presented as an 
improved version of OS. The authors argue that in the calculation 
of OS, both centering and sc;aling factors per gene are derived 
from all observations, i.e. from both groups, which leads to 
overestimation of variance and location in case of present 
subgroups. In contrast to OS, the threshold for outlier definition 
coRT is derived based on the observations in the control group 
only. Let x = {x\, . . . ,x„^,x„^^\, . . . ,x„p+„^)' denote the observed 
expression values for a single variable. The median of the controls 
C, C = {x\, . . . ,Xnf.), is denoted by med^ , and mecP is defined 
analogously for the disease group D, D = (x„^^i, . . . ,x„„). q^^ and 
IQR'^ denote the 75 percent quantile and the interquartile range, 
respectively, of the expression values in C. The set U of outlier 
disease sampks contains the observations in D that exceed the 
threshold Cqrt = 9^$+ IQR'^ ■ Then the test statistic can be written 
as 



^Xi—med'^ 
u 



median{\xi—med'^\^.^cMi~f^^d'^\xjED} 

The authors showed that ORT outperforms OS in several 
situations. 



PADGE is a more comphcated method that conducts statistical 
tests on a set of subsets of two groups. First, the user chooses a 
series c,,/ = 1, . . . ,T, e.g. 80, 85 and 90 percent, which are used to 
define subsets in the controls C and diseased patients D, 
respectively, as follows: 

Ct = {observations in C that exceed the Ci quantile in C}, 

Di analogously. The chosen statistical test, e.g. Student's /-test or 
Wilcoxon's rank-sum test, is derived on each of the T resulting 
pairs of subsets, and the corresponding p-values are corrected for 
multiple percentiles afterwards. Additionally, the expression ratios 
Vi between the subsets are calculated. Assuming that the control 
group is homogeneous and the disease group contains an up- 
regulated subgroup, the difference in location will become more 
apparent with increasing quantiles. The authors propose a 
summary score for candidate ranking that considers the corrected 
/7-values Pi as well as the (relative) expression ratio between the 
(subgroups of) samples: 



: max 



where I'o is the expression ratio between the groups when all 
observations are included. Thus, the term ri/ro describes the 
relative variability of overexpression in the disease group. As the 
exact computational steps are not described down to the last detail 
in the corresponding publication, we use a PADGE-like score in 
our simulation study. 

PACK allows the researcher to look for variables where the 
sample falls into major subdivisions as well as variables that show 
small outfier groups. The purpose of PACK's clustering step is the 
preselection of variables that are most likely to show heteroge- 
neous expression. We skip the initial step and directly compute the 
kurtosis for each variable by using the R package el 071 [12]. This 
simpler one-step version is referred to as PAK in the manuscript. 

We also included Bartiett's test for homoscedasticity (from the R 
package stats, [13]) in our study, because basically each variable 
that shows significant changes in variation between the two groups 
in comparison can be potentially interesting. The test assesses if 
two groups have equal variance, under the assumption of 
normality. 

FisherSum: a new method based on the idea of Fisher's 
exact test 

The software ProtoAnay Prospector (Life Technologies, Carlsbad, 
California, USA) provides a test statistic that is called Minimum M 
statistic. It is a rank-based method that aims at finding unknown 
subgroups of patients in two-group comparisons. Basically, the 
proposed procedure is equivalent to a Minimum Fisher's exact 
test, where the minimtim is determined from a set of /i-values. In 
turn, each observation is tested as possible cutpoint. Since we focus 
our attention on smaller subgroups, we do not adopt the method 
as it stands but instead use Fisher's exact test in a similar manner. 

Generally, Fisher's exact test is used to analyze the statistical 
dependence of two binary variables. We investigate the depen- 
dence between the group membership (disease group D or controls 
Oj and the size of the observed values, i.e. whether the individual 
values exceed a particular cutoff cut. This corresponds to a 
contingency table as shown in Table 1 . 

Here, n\\ denotes the number of observations in D with 
expression values above the cutoff cut. For our purpose, a 
reasonable cutoff is a quantile of the values in the disease group 
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Table 1. Contingency table to assess the dependence 
between the group membership and the size of the observed 
values. 
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per variable. In the following, dy denotes the y percent quantile of 
D, i.e. 10 percent of the values in D are larger than dgQ. Under the 
nuU hypothesis that there is no difference between the groups, dy 
should be similar to Cy, the corresponding y percent quantile of the 
controls, and one would expect that about 100 — y percent of the 
values in C and in D, respectively, exceed dy. On the contrary, in 
case of a real e.g. 10 percent subgroup in D with increased values, 
no (or at least few) values in the control group should exceed the 
cutpoint dgQ. 

For each variable in the data set, we choose the corresponding 
quantile dgg as cutpoint cut. Our experience is that this cutoff 
yields good results on both real and simulated data compared to 
other quantUes. It does not attach too much importance to single 
outliers or very small groups of up-regulated observations, that are 
more likely to be false positives or hard to validate, and at the same 
time it works also well for subgroups larger than 1 0 percent. In the 
latter case, only the most extreme observations are taken into 
account, which corresponds to a smaller subgroup with larger 
deviation from the remainder of the group. Therefore, variables 
that show a global dilference between the two groups will also yield 
a higher FisherSum score depending on the amount of shift. In 
principle, the /(-values of the corresponding Fisher test (which we 
call Fisher 10, see File SI) serve the purpose to rank variables 
according to the potential existence of up-regulated subgroups, but 
deriving the Fisher tests for each variable of a high-throughput 
data set leads to large numbers of ties regarding the p-values, due 
to the discrete character of the test statistic. Besides, it can yield a 
number of false positive results. Thus, instead of directiy using the 
/i-values, we propose a scoring method following the idea of OS. 
Basically, after centering the observations with the median of the 
controls, we sum up the values that correspond to wn, i.e. the 
values in D above the cutpoint, and subtract the sum of 
observations corresponding to «i2. In more detail, both sums are 
adjusted using weights Wo and U'c such that our test statistic 
FisherSum can be written as 

FS = n'/5 ^ x-wc ^ X. 

The subtraction represents a penalty for variables that show up- 
regulated subgroups in both control and disease group. We refer to 
this pattern as non-disease-specific subgroups. A natural choice for 
Wd and iv'c is l/«z) and l/«c, respectively, which becomes 
important in case of unbalanced designs or for the comparison of 
scores across studies with different sample sizes. If only a single 
study with balanced design is of interest, one can simply choose 
W£i = Wc = 1 . In general, the weights can also be used to adjust the 
magnitude of the penalty for a non-disease-specific pattern if 
desired by the researcher. 

Depending on the study design, one might want to provide p- 
values in addition to the actual value of the FisherSum FS to 



control the type I error. As for methods like OS and ORT where 
the distribution of the test statistic is not expressible in closed form, 
p-values can be obtained by estimating the distribution of FS 
vmder the nuU situation. For an explorative analysis and to 
generate hypotheses for future studies, it might be sufficient to rank 
the variables according to decreasing FS values, as it was 
analogously proposed for PACK [5] . 

We are well aware that Fisher's exact test assumes fixed margins 
instead of fixed entries in certain cells. In our simulation we 
compared these two methods, computing the quantile dy from the 
disease group only, which results in stochastic margins with fixed 
cell values as well as the computation of the (100 + y)/2 quantile 
from the pooled data. The first approach outperformed the 
theoretically more accurate second version in the simulation (data 
not shown) and was chosen as the new Fisher type algorithm. Due 
to our focus on up-regulated subgroups in Z>, we apply a slight 
modification where we set negative values to zero. 

Simulation Study for Subgroup Detection 

A commonly assumed distribution for expression values is the 
normal distribution N(ji,a^) which is meant to represent the noise 
in real data. Usually, in simulation studies the values of a subgroup 
are then drawn from N{^-\-&,(T^),&>Q, most often with 5 = 2. 
While we generally agree with the assumption of normality for the 
majority of observations, we recommend to reconsider the 
assumptions regarding the dysregulated subgroup. We are 
convinced that in the biological context, a shift in means usually 
goes along with an increased variance, which makes truly existing 
subgroups even harder to detect. To our knowledge, we are the 
first to evaluate the effect of simultaneously increased mean and 
variance in the patient subgroup together with the yet ignored 
non-disease-specific patterns. Note that this simultaneous increase 
of mean and variance is due to biological reasons and relates to the 
values corresponding to the patient subgroup in comparison to the 
other observations. This biological effect appears for variables 
regardless of their location, i.e. for variables with high as well as 
with lower values. The increased variance would also apply to 
subgroups that are down-regulated in comparison to the 
remainder of the observations. It is not to be confused with the 
general effect of variances depending on the mean, which should 
be taken care of by appropriate data normalization: a log-like scale 
instead of the original scale, e.g. log-intensities in gene expression 
analysis or in fluorescence-based techniques like protein micro- 
arrays. Without proper normalization and transformation, larger 
values tend to have larger variances due to technical reasons. For 
variables with values in the lower to medium range the effects may 
be negligible. On the other hand, for variables with large absolute 
values and a truly present patient subgroup there is more variation 
to the values themselves and the resulting computed scores. 

In summary, the aims and special features of our study are as 
follows. We assess the effect of different types of distributions for 
the observations in a patient subgroup. For each distribution type, 
a range of parameters is considered, which corresponds to an 
increasing difference between the subgroup and the remainder of 
the observations. This permits a comprehensive comparison of the 
different methods. Additionally, our extended definition of the null 
situation allows the assessment of the robustness of each method 
against false positive results, where both control and disease group 
show up-regulated subgroups. This indicates a non-disease-specific 
subgroup, that might be due to unknown confounders or to single 
values that are extreme by chance, and that is not considered 
relevant for the characterization of disease subgroups. 
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Design of the simulation study 

Before going into details of the simulation, we first establish 
some notation. Consider a two group comparison between a 
control group C and a disease group D, both of sample size n. Let 
D contain a patient subgroup of size nsG = g'n with (/ep,!], and 
w.l.o.g. assume qnef^. In clinical practice, the focus often is on 
smaller subgroups with q<03. However, for statistical consider- 
ations and for the sake of completeness, we also include large 
values of q up to <? = 1 , which corresponds to the whole disease 
group. Let rf, denote the underlying distribution for the patient 
subgroup in scenario s, s = l, II, III, and let z denote the degree of 
deviation from the remainder of observations which follow the 
standard normal distribution N{0,\). Then the distribution Z)., in 
the complete disease group D is modelled as follows as in equation 
(1) with the subgroup distribution as listed in Table 2. 

D, = {l-qyN(0,l) + q-d, (1) 

Distribution Dj reflects the assumption that is made most 
commonly in publications to date. Dji is an example for the 
combination of simultaneously increased mean and variance, 
where C/[0,Z)] denotes the uniform distribution on the interval 
[0,b]. We have chosen a one-parameter alternative hypothesis, due 
to comparability with the other cases. Even though this 
distribution is not a typical choice, the resulting distributional 
patterns look much more like real data distributions in comparison 
to D[. Regardless of the actual distribution rf, of the subgroup 
(^< 1), the variance in the disease group is increased compared to 
the variance in the control group. Moreover, it seems quite 
reasonable from a biological point of view, that levels of e.g. 
disease related proteins have a wider distribution in D. In contrast 
to scenarios I and II, there is no shift in the expected means 
between the groups in scenario III, and the dysregulation is two- 
sided. We include this diflferent scenario to point out that there are 
even more possibly interesting alternatives. It is interesting to see 
how methods mainly optimized for scenario I perform in scenario 
III. For frxed sample and subgroup sizes, each of the scenarios can 
be parameterized by a single parameter which will be referred to 
as z. Depending on the scenario, z equals d, b, or a, respectively. 

To our knowledge, we are the first to consider difiFerent 
distributional assumptions as well as varying subgroup and sample 
sizes. Previous publications mostly varied the proportion of the 
subgroup for a fixed shift in means. Our simulation study includes 
all combinations of sample sizes « = 20, 30,50, 70,100 per group 
and subgroup proportions (7 = 0.1,0.2,0.3,0.5,0.75,1. Results not 
shown in this paper can be found in File SI. Depending on the 
omics technique, sample sizes are sometimes very low (below 10). 
Even so, whenever the aim of a study is the characterization of 
unknown subgroups, the data sets should contain significandy 
larger numbers of samples in order to achieve reliable and 

Table 2. Different subgroup distributions considered in the 



simulation studies. 
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meaningful results. Hence, our study only includes sample sizes 
starting from n = 20. 

In our setting, we want to distinguish variables that show die 
desired up-regulation only in a subgroup of the disease group D 
from variables with a so-called nuU situation. Let /o and fi denote 
the density of a variable without and with subgroup, respectively. 
Variables with disease-specific subgroup pattern {Hi) have 
densities fo and /i in control and disease group, respectively, 
whereas the other variables have the same densities in C and D. In 
case of no subgroups at all, fo holds true for C and D and if 
subgroups are present in both groups, all observations follow /i 
(according to scenario I, II, or III). These two null situations are 
called Hca and Hoh, respectively. By expanding the definition of 
the nuU situation, we are able to assess the robustness of the 
different methods against non-disease-specific patterns {Hohj- 
Results of simulation study and real data analysis wUl demonstrate 
this advantage of our study design over the commonly used 
simpler version. For reasons of comparability with previous 
studies, we conduct our comparison study both with the simple 
nuU situation Hq^ only and with the combination of the two null 
distributions //qo and Hq/,. Let /7^j^^e[0,l] be the proportion of 
variables from 7/oa among the nuU situations. For our simulations 
we chose pff^^ = l/2 in order to evaluate the impact of the new null 
situation on the difiFerent methods. Per parameter setting we 
simulated 1000 variables in total and obtain and obtain a data 
matrix that contains variables from Hoa, Hqi, and Hi with different 
frequencies which are given in Table 3. 

We compare the different methods by means of a quality 
criterion that is defined analogously to the AUG (area under the 
receiver operating characteristics curve). In the clinical context, 
the AUG is mosdy used to assess the performance of a binary 
classifier. For example, a binary response such as healthy vs. 
diseased is predicted using a continuous variables like gene 
expression values or clinical parameters. In our study, the presence 
of the required distributional pattern for each variable is 
interpreted as binary response which is predicted with p-values 
or scores, respectively. In particular, we generate a data set of size 
1 000 X 2n for every parameter combination according to the 
described pattern. This represents 1000 variables with 2n 
observations each (« for C and n for D), where the first 500 are 
drawn from a nuU situation and the last 500 from the respective 
alternative. The response vector truth thus has the form 
truth = {Ho,...,Ho,Hi,...,Hi)'. Then, the AUG is derived for 
each method separately by predicting this truth with the 
corresponding vector of /f-values and scores, respectively. Sensi- 
tivity corresponds to the probability of correctly identifying a 
variable with a present subgroup only in D, whereas specificity 
means that variables with identical distributions for both groups C 
and D are classified correctly. We utilize the R package pROG 
[14]. 

Note that the application of AUG does not require the 
specification of a threshold based on which the single variables 
are classified as Hq and Hi , respectively. The choice of this cutoff 



Table 3. Structure of the simulated data matrix. 







C 


D 


# Variables 


Distributional Pattern 




fo 


fo 


250 


no subgroups at all 




f, 




250 


subgroups in C and D 


H, 


fo 




500 


subgroup only in D 



doi:l 0.1 371 /journal.pone.0079380.t003 
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Figure 1. Comparison of different methods by means of AUC values. Deviation z from iV(0,l) is given on x-axis, AUC value for distinguishing 
variables with and without the desired subgroup pattern on y-axis, according to simulation scenario I (mean increase, panel A), scenario II (mean and 
variance increase, panel B) and scenario III (variance increase, panel B), with group sizes «c=«o = 70 and true subgroup proportion (/ = 0.1. Plots 
correspond to = 0.5, see text for more details. Colors of lines correspond to upper theoretical bound (black, LR) and seven tests and scores (other 
colors, see legend). 

doi:1 0.1 371/journal.pone.0079380.g001 



depends on the study aim, where the focus may be either on 
specificity or on sensitivity, in other words on minimizing type I 
and type II error, respectively. Depending on the subsequent steps 
of the analysis one might want to start with a larger set of 
potentially interesting variables. To provide generally applicable 
results, we chose the AUC as quality criterion, as it allows the 
simultaneous assessment of sensitivity and specificity of each test 
method. The AUC scale has been used before, see e.g. [15]. Most 
of previous work on subgroup detection methods used true positive 
vs. false positive plots to illustrate results and compare different 
methods with respect to type I error. This is a convenient 
presentation only if the number of considered alternatives is small. 
As our focus is on the performance of the methods across a large 
range of alternatives (e.g. depending on the degree z of deviation 
from the null hypothesis), this kind of presentation is less 
appropriate than a comparison on the AUC scale. 

Assessing upper bounds for performance using the 
likelihood ratio 

In our simulations, we are not only able to compare different 
tests and scores regarding their performance, but we can also 
calculate the theoretically best performance achieved by the 
likelihood ratio (LR). It can reveal for which scenario or parameter 
range an improvement of the current methods would be most 
worthwhile. The LR is derived as ratio of Li, the likelihood based 
on the desired distributional pattern, and Lq, the likelihood 
assuming a nuU situation. According to the notation introduced 
above, LR can be derived as 



LR = UIU-- 



nr/o-nz,/i 



ph^^ nc/o nz,/o + (1 -pHjVich nz)/i 
1 



Note that LR must outperform the other methods in all our 
simulations, but it is not applicable to real data in this exact manner, 
since true subgroup size and subgroup distribution are not known for 
real data scenarios. However, in the simulation study the densities /o 



and /i are defmed by equation (1) and with all parameters fixed, the 
densities of the mixed distributions can be simulated easily. The 
parameter Ph^^, tiiat characterizes the composition of the nuU 
situation, is chosen by the researcher as well. Altogether, the derivation 
of LR is straight forward in the simulation study. 

Results 

We first present the results of an extensive simulation study. 
Then, a real data analysis is provided for proteomics measure- 
ments from a study on Parkinson's disease. This study shows that 
interesting patient subgroups and corresponding variables can be 
detected with the appropriate tests. 

Simulation study 

We illustrate the results of the simulation study in two steps. 
First, we consider a specific combination of sample size n and 
subgroup proportion q under various distributional assumptions, 
see Figure 1. We discuss the case of a 10 percent subgroup in a 70 
versus 70 comparison, corresponding to the real data set that is 
analyzed afterwards. Our focus is on the advanced simulation 
design that incorporates two patterns for the nuU situation. 
Second, we describe the influence of variations in the parameters n 
and q, when the scenario is fixed (see File S 1 , section 1 . 1 for 
composite and section 1.2 for the simple null situation). 

The following results correspond to the simulation design with 
the composite null situation and Ph„^=0.5. In Figure 1, each 
scenario is summarized by a single plot that depicts the AUC of 
the different methods depending on the degree z of deviation of 
the subgroup's distribution from A'^(0,1). Thus, for each point z the 
methods could be compared in more detail regarding false positive 
and false negative rates by comparing the corresponding ROC 
curves (see also Figure 2). Again, we focus on the dependency of 
AUC values from z, to which to date littie attention has been paid. 
As can be seen from Figure 1, our new method FisherSum 
outperforms the other methods considerably in the case of up- 
regulated subgroups (scenarios I and II). The next best methods 
are /-test and Bartlett's test, where the former one attains higher 
AUC values for smaller deviations while the latter one performs 
second best for larger values of z. Note that AUC values for 
methods that do not check for outlier observations in the control 
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Scenario I 



B 



Scenario II 



Scenario 



0) 




CO 




Specificity 



0.6 0.4 0.2 0.0 
Specificity 



0) 
CO 




0.2 0.0 



Specificity 



Figure 2. Comparison of different methods by means of ROC curves. Three panels correspond to panels in Figure 1, i.e. scenarios I, II, III with 
Ph^ =0.5 from left to right, see text for more details. The curves correspond to the following : values: (<5,6,(7) = (2,3,3)- Colors of lines correspond to 
upper theoretical bound (black, LR) and seven tests and scores (other colors, see legend). 
doi:1 0.1 371 /journal.pone.0079380.g002 



group, namely OS and ORT, do not necessarily converge to 1. 
Overall, the performance of all methods is quite similar between 
scenario I and II and the order of the tests is basically the same. 

The essential difference of scenario III compared to scenarios I 
and II is that the deregulation is bidirectional and does not cause a 
shift in the theoretical mean. Hence, AUG values from the /-test 
are close to 0.5, and PADGE works only slightly better. From 
Figure 1 we also see that the gap between the optimal LR and the 
best method is much larger in this scenario, i.e. other tests are 
required for this kind of subgroup pattern. Except for very small 
deviations, Bartlett's test for homoscedasticity performs best. 
FisherSum is second best and distincdy better than the group of 
OS, ORT and kurtosis. Basically, FisherSum considers variables in 
scenario III as variables with an up-regulated subgroup of size 
nq/l. 

The most interesting numbers regarding the parameter 
combination (m,^,/)//„__) = (70,0. 1,0.5) are summarized in Table 4. 

To receive an impression how the type I errors of the different 
methods compare, we present a generic set of ROG curves for 

Table 4. Comparison of methods for subgroup detection 
with respect to AUC values, for group size «c = «b = 70 and 
subgroup size (7 = 0.1, =0.5, corresponding to Figure 1. 



Criterion 


AUC(LR)>0.95 




AUC(Method) 


= 0.95 


Scenario 


1 


II 


III 


1 


II 


ill 


LR (optimum) 


0.96 


0.95 


0.96 


2.0 


3.63 


2.93 


FisherSum 


0.89 


0.87 


0.72 


2.6 


5.28 


>6 


ORT 


0.79 


0.77 


0.68 




cc 




OS 


0.70 


0.72 


0.66 


CO 


cc 


00 


t-test 


0.79 


0.77 


0.49 


4.2 


9.24 




PADGE 


0.75 


0.74 


0.55 


6.6 


>10 


00 


Kurtosis 


0.60 


0.62 


0.66 


4.6 


>10 


>15 


Bartlett 


0.70 


0.72 


0.80 


3.2 


6.60 


5.33 



Left three columns contain AUC values when tests are comparecJ at z value with 
AUC = 0.95 for LR. Right three columns correspond to z values with AUC = 0.95 
for each test. Best values plotted In bold per simulation scenario. 
doi:10.1371/journal.pone.0079380.t004 



each scenario and the combination {n,q,pfj^^^) = (70,0.1, 0.5) in 
Figure 2. Each panel in Figure 2 corresponds to one single value of 
z in one of the three panels in Figure 1. Since we want to 
summarize the method comparison across a range of alternatives, 
we prefer to present the results on the AUG scale. 

For other combinations of group size n and subgroup 
proportion q similar statements hold true. We summarize die 
findings in the next paragraphs. Gorresponding plots can be found 
in File S 1 , section 1.1. For small to moderate subgroup sizes up to 
about 20-30 percent the new method FisherSum outperforms the 
existing methods in scenario I and II for all considered sample 
sizes in the middle range of deviations. For large subgroups of at 
least 50 percent, in scenarios I and II, the t-test performs best, 
followed by FisherSum, ORT and PADGE which yield quite 
similar results. For small deviations, these scores achieve values 
distincdy above 0.5, whereas kurtosis, OS, and Bartlett's test 
perform worse. 

In case of a composite nuU situation with pfj^^ =0.5, simulation 
results point out the major drawback of OS and ORT. For 
moderate subgroup sizes around 20 percent, they seem to work 
well for small deviations but then AUG values decrease, with an 
asymptotic value of 0.75 (in general 0.5+Pho„/2). Apart from the 
fact that AUG values of kurtosis converge to 1 for small subgroups, 
it appears that generally other methods should be favored. 

We point out that the kurtosis allows for the detection of two 
kinds of alternatives. On the one hand, large positive values 
indicate the existence of smaller subgroups, which we are 
interested in. On the other hand, if the size of the subgroup is 
about 50 percent, then large negative values are observed. As we 
analyze positive kurtosis as measure for subgroup detection, 
simulation results with larger subgroups show that the corre- 
sponding AUC values become much smaller than 0.5. Switching 
the group assignment in the calculation of the ROG curve would 
mirror the corresponding curves horizontally at AUC = 0.5. 

In scenario III, in contrast to scenarios I and II, Bardett's test 
performs best for virtually each combination of sample size and 
subgroup size. Only for smaller values of n, q, and ff FisherSum 
performs slighdy better. Noteworthy, even the gap between the the 
optimal LR and the Bartlett test for homoscedasticity is quite large. 

Additionally, we conducted the whole simulation study again 
with PHo„ = 1, which corresponds to the commonly used design. In 
absence of non-disease-specific subgroups, OS and ORT perform 
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much better and AUG values converge to 1 as desired. While there 
are some more differences concerning the other tests, the overall 
results are similar when it comes to best performances. 

Application to real data 

In this section we assess the performance of the methods on a 
real data set. Whereas the task in the simulation study is to 
distinguish between the (composed) null situation and a single type 
of alternative, there is a mixture of several alternative patterns in 
the real data set, that may also include global shifts or single 
outliers for example. As can already be seen from the simulations, 
different methods detect different patterns well. 

We analyze a subset of the data from the ParkCHIP project. 
The data have been described before, see [16], and are available at 
www.medizinisches-proteom-center.de/ Ahrens_et_al. Samples 
from 72 patients with Parkinson's disease (PD) w(;re compared 
to 72 samples from age and gender-matched healthy controls. The 
protein microarray used in this study provides about 9500 
variables. As PD is known to be a heterogeneous disease (e.g. 
[17]), we expect that the methods detect interesting variables. 

W'c applied all seven methods studied above to this data set and 
inspected the empirical distributions of the top ranked variables 
according to the obtained /7-values and scores, respectively. File SI 
contains the corresponding plots for the top 15 candidates for all 
methods (File SI, section 1.3). In summary, only FisherSum, 
Fisher 10 and PADGE find a large number of variables with the 
desired pattern, where only a subgroup of observations in the 
disease group is up-regulated. FisherSum favors variables with 
wider distributions, as was expected because FS only centers but 
does not scale the observations. In contrast, among the top ranked 
variables of FisherlO and P.\DGE. there are also candidates with 
small variation and only slight incTeases in the subgroup. Although 
even slight increases may have a biological relevance, they are 
often considered to be false positive results. Most candidates 
chosen by OS and ORT show small non-disease-specific 
subgroups. Kurtosis and Bardett's test primarily detect variables 
with a single or very few oudiers and thus are not suitable for our 
task either. 

Next, we focus on the results of FisherSum and compare it to 
the f-test. In Figure 3 (panel A) we plot the (— logjo) /"-value of the 
t-test against FS for each variable of the data set. The measures 
show moderate correlation (Pearson correlation r = 0.67), but 
especially the extremes are quite different and FS reveals some 
interesting additional variables. We picked three variables from 
each of the following categories: largest FS, smallest /i-value in t- 
test, highly ranked in FS but poorly ranked in ?-test, and vice 
versa. The corresponding (log2)-intensity plots are shown in 
Figure 3 (panel B). 

Finally, we searched the literature for the top candidates found 
by FisherSum, see Tabk' 5. Our method finds candidates that have 
been associated to Parkinson's disease or neurodegenerative 
diseases in general as well as yet unknown candidates. We 
conclude that FisherSum is able to detect biologically relevant 
subgroups. Note that the two top-ranked variables are both 
assigned to the PALM2 gene with basically the same potential 
subgroup identified. The same holds true for the PDPKl gene, 
with ranks 6 and 19 (not shown here). Hence presumably these 
candidates are true positives. 

Discussion 

We compared various tests and measures for subgroup detection 
on simulated and real data. We have shown that our FisherSum 
method outperforms existing methods over a large range of sample 



sizes if the subgroup size is small or moderate (up to about 25 
percent). This limitation in subgroup size is in line with our initial 
aim to detect smaller subgroups. We use the 90 percent quantUe in 

the disease group as outpoint to construct a contingency table 
(Table 1) as it is used for Fisher's exact test. This corresponds to a 
minimal subgroup size of interest equal to 10 percent and also 
yields good results for moderately larger subgroup sizes. 

It has been shown previously, that the best strategy for subgroup 
d(;t(-cti()n may depend on the combination of sample and 
subgroup size as well as on the true degree of deviation from the 
nuU situation with no differential expression. To date, the 
influence of the underlying distribution of the subgroup has been 
ignored for the most part. By comparing the commonly assumed 
distribution with two others scenarios, we point out that the true 
underlying distribution does have an effect on the performance of 
the methods that were included in our simulation study. However, 
having a closer look at the results of scenarios I and II, we do not 
notice a fundamental difference between the test performances. In 
our opinion, the widely assumed subgroup distribution N{S,a'^) is 
too simple, because of the fairly good separation of the subgroup 
from the remainder of the disease group for larger values off). On 
the other hand, the distribution for scenario II makes it quite hard 
to detect the true subgroup, as it always overlaps with the other 
observations. Due to th(- similar results for these two extreme 
situations, we expect comparable conclusions for mixtures of these 
scenarios. We included scenario III to demonstrate that the class of 
subgroup problems is not limited to location shift problems. To 
our knowledge, an increased variance in a patient subgroup has 
not been discussed before. If non-specific subgroups are included 
in the null situation, none of the compared methods shows good 
performance for small subgroups when c:ompared to LR. This 
underlines that novel methods are required if new classes of 
subgroup distributions are considered, because the commonly used 
subgroup detection methods might fail. 

The aim of our simulation study was to provide an overall 
comparison of several mc'tliods for the detec tion of patient 
subgroups. Therefore, we chose three different distributions for 
the subgroup and within these scenarios we gradually increased 
the degree of deviation from the nuU situation. Thus, we are able 
to check for a unfformly best method or assess the performance of 
competing methods across a certain range of alternatives rather 
than focussing on a single parameter value. For example, in our 
simulation, comparing the ?-test and Bardett's test for homosce- 
dasticity, the ?-test performs better for small deviations from the 
nuU situation, whereas Bardett's test is better for larger deviations 
for certain combinations of sample and subgroup size. In short, the 
two major advantages of our study design arise from the 
consideration of a wide range of alternatives for several 
distributions as well as from the incorporation of a new type of 
nuU situation, where subgroups are present in both groups. Results 
of simulation and real data analysis show that our new method is 
less likely to yield false positive results in terms of these non- 
disease-specific variables. 

Since our data is drawn from normal distributions with variance 
equal to one, the relevant values of the deviation (e.g. the required 
z value to attain a certain AUG value, or the value z where one test 
outperforms another one) might not be appropriate for real data. 
In cases of doubt, we recommend to adapt this study with 
variances that fit a specific application. Differences in the amount 
of variance in the data may be due to different biological variances 
or difierent technical variances that correspond to different omics 
techniques. To generate data sets that are as close to real data as 
possible, variances might be drawn from empirical distributions of 
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Figure 3. Comparison of FisherSum and Mest on real data. Panel A: Score of Mest (-logiQ (ji)) plotted against FisherSum, applied to ParkCHIP 
data [16] using weights h'c = »'c = 1- Each point represents a variable in the data set and the two corresponding measures. Highlighted are three 
variables for each of the following categories: highest score for FisherSum, smallest ;7-value for /-test, highly ranked in FisherSum but not in Mest, and 
vice versa. Panel B: log2-intensity plots corresponding to variables highlighted in panel A. Observations on the left-hand side of each plot represent 
the control group, and the right-hand side corresponds to patients with Parkinson's disease. To point out the differences in variation, we used the 
same scale for all variables. 
doi:1 0.1 371/journal.pone.0079380.g003 



variances from real data sets measured with the technique of 
interest. 

At this point, we did not include multivariate methods that are 
commonly applied in the field of statistics. A univariate approach 
is appropriate, if a study either aims explicitly at the detection of 
univariate biomarker candidates or if dimensionality reduction is 
required. Candidates of univariate analyses may be combined by 
multivariate procedures like hierarchical clustering afterwards. 
Note that due to the different selection criteria the results wiU differ 
from clustering results using the common approach. For the 
detection of small subgroups that are present in a small proportion 
of the observed variables, specific approaches are required. An 
important next step is to derive an appropriate method for the 
combination of subgroup indicating variables. 



Table 5. FisherSum's top-ranked variables for ParkCHIP data 
[16]. 



FisherSum 




r-test 






Rank 


FS 


Description 


Pt 


Rank 


Reference 


1 


25.7 


PALM2 


0.027 


141 


[18] 


2 


25.4 


PALM 2 


0.010 


42 


[18] 


3 


24.9 


MTHFR 


0.037 


217 


[19] 


4 


24.9 


GSK3A 


0.001 


2 




5 


24.9 


PPP1R2P9 


0.017 


90 




13 


18.7 


CALB2 


0.083 


636 


[20] 



For the above mentioned reasons, we chose wo = it'c = 1 ■ In the literature, some 
of them have already been associated with neurodegenerative diseases, in 
general or in particular with PD. According to /7-values of the r-test (no 
adjustment for multiple testing), some of these candidates would have been 
missed. 

doi:l 0.1 371 /journal.pone.0079380.t005 



As already stated in the beginning, the assumption of at most 
one single subgroup per variable does not exclude the existence of 
several subgroups in the disease group regarding the multivariate 
dataset. The distributional pattern that we use in the simulation 
study only requires that if the disease group is composed of more 
than one subgroup, then the sets of effected variables are disjoint. 
For methods presented previously in the literature as well as for 
FisherSum, the subgroup structure may be assessed in more detail 
by an additional subsequent analysis. Just a short remark on the 
effect of multiple subgroups that are present in the same variable. 
In this case one has to differentiate further according to the 
direction of the regulation in the subgroups. If the subgroup 
regulations show the same direction, e.g. up-regulation, the 
variable is even easier to detect for the majority of the presented 
methods. On the other hand, if a variable shows at least one up- 
regulated and at the same time at least one down-regulated 
subgroup, the performance of the different methods varies. 
FisherSum for example is not affected in this case, while outlier 
sum could perform worse due to increased variation in the 
variable. In this context, an alternative interpretation of the third 
scenario with increased variance in the subgroup is possible. 
Consider the subgroup observations above and below the mean as 
two different subgroups. One of them is up-regxilated, the other 
one down-regulated with the simplification that the degree of 
deviation is equal for both subgroups. The actual size of each 
subgroup would equal nq/l. 

The presented methods are appropriate for explorative analyses 
and for the generation of new hypotheses on potential patient 
subgroups. Thus, the resulting candidate variables and the 
respective marked patient subgroups require further biological 
validation. Apart from correlations within the data sets one can try 
to correlate the potentially subgroup indicating variables with 
clinical variables such as survival times to check for their relevance. 
The explicit determination of the potential patient subgroup is of 
particular interest, but it is provided directly only for OS and 
ORT. Obviously, for FisherSum, it is a natural choice to assign the 
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samples above the chosen cutofT to the potential subgroup. This 
will work well, if the true subgroup size is about 10 percent. 
Another way to identify subgroups based on the set of potentially 
interesting variables is to conduct hierarchical clustering with the 
top (e.g. 50) candidates of FisherSum. Assuming that a small 
number of variables indicate the same subgroup, an existing 
subgroup structure is expected to become apparent. 

The presented results for our new method are promising as we 
are able to detect disease-related variables in real data. One option 
for future research is the introduction of scoring methods that 
allow for the separate optimization of sensitivity and specificity, 
respectively. There are several ways to adjust our method to the 
specific characteristics of a data set, for example by choosing a 
different quantUe, by varying the weights Wc and Wn, by defining 
thresholds for ^-values of Fisher's exact test, or by incorporating 
the distance between the potential subgroup and the remainder of 
the group. An important goal for the future is to evaluate different 
approaches for the combination of subgroup indicating variables 
in order to gain insight into the overall patient subgroup structure. 

Supporting Information 

File SI Figure SI, In-depth comparison of different 
methods' AUC values for the composite null situation. 

The study includes all combinations of sample sizes 
« = 20,30,50, 70,100, subgroup proportions 

9 = 0.1,0.2,0.3,0.5,0.75,1, and scenarios s = I,II,III with 
/>jyn,=0.5. AUC values are plotted against the deviation z. 
Figure S2, In-depth comparison of different methods' 
AUC values for the simple null situation. The study 
includes all combinations of sample sizes « = 20,30,50,70, 100, 
subgroup proportions (/ = 0.1, 0.2, 0.3, 0.5, 0.75,1, and scenarios 
s = I, II, III with pjj^ = 1 . AUC values are plotted against the 
deviation z. Figure S3, (log2-)intensity plots of the top 
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candidates for ParkCHIP data. To compare the considered 
methods, the respective top 15 candidates of each method are 
shown in (log2-)intensity plots. We used controls and Parkinson 
patients from the ParkCHIP project. Table SI, Overview 
scenario I, composite null situation. Best tests for all 
combinations of sample size n and subgroup proportions 
g' = 0.1,0. 2,0. 3,0. 5,0. 75,1 in scenario i,PHoa =0.5, neatiy arranged 
in a color-coded table. Table S2, Overview scenario II, 
composite null situation. Best tests for all combinations of 
sample size n and .subgroup proportions (^ = 0. 1,0. 2, 0.3, 0.5,0. 75,1 
in scenario II, pfj^^^ =0.5, neatly arranged in a color-coded table. 
Table S3, Overview scenario III, composite null situa- 
tion. Best tests for all combinations of sample size n and subgroup 
proportions (^ = 0. 1,0. 2, 0.3, 0.5,0. 75,1 in scenario III, =0.5, 
neatly arranged in a color-coded table. Table S4, Overview 
scenario I, simple null situation. Best tests for all 
combinations of sample size n and subgroup proportions 
q' = 0. 1,0. 2,0. 3, 0.5,0. 75,1 in scenario I, PHaa = ^! neatly arranged 
in a single table. Table S5, Overview scenario II, simple 
null situation. Best tests for all combinations of sample size n 
and subgroup proportions g' = 0. 1,0. 2,0. 3, 0.5,0. 75,1 in scenario II, 
PHoa — 1> neatiy arranged in a single table. Table S6, Overview 
scenario III, simple null situation. Best tests for all 
combinations of sample size n and subgroup proportions 
(7 = 0.1,0.2,0.3,0.5,0.75,1 in scenario lIl,pHoa = 1> neatiy arranged 
in a single table. 
(PDF) 
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